Skip to content

Split test/device/random.jl into multiple files#233

Closed
AntonOresten wants to merge 1 commit into
JuliaGPU:mainfrom
AntonOresten:random-split
Closed

Split test/device/random.jl into multiple files#233
AntonOresten wants to merge 1 commit into
JuliaGPU:mainfrom
AntonOresten:random-split

Conversation

@AntonOresten

Copy link
Copy Markdown
Collaborator

Most files in the test suite are on the high end 30-60 seconds, but I've noticed the random number generation tests take a disproportionate amount of time, at around 2-5 minutes, and allocating a total of 78 GB on the CPU:

See latest buildkite run (v1.12)

Link: https://buildkite.com/julialang/cutile-dot-jl/builds/400/canvas?jid=019e6496-64bc-4264-8c4f-a61c88f04090&tab=output

                             │   Test   │ ──────────── GPU ───────────── │ ──────────────── CPU ──────────────── │
Test                (Worker) │ time (s) │ GC (s) │ Alloc (MB) │ RSS (MB) │ GC (s) │ GC % │ Alloc (MB) │ RSS (MB) │
device/reductions        (2) │    87.62 │   0.01 │       0.66 │   118.00 │   2.53 │  2.9 │    6598.31 │  1495.75 │
codegen/operations       (2) │    66.72 │   0.00 │       0.00 │    86.00 │   1.60 │  2.4 │    5753.27 │  1670.48 │
examples/fft             (2) │    22.28 │   0.00 │       1.14 │   118.00 │   0.17 │  0.8 │     704.30 │  1808.48 │
codegen/integration      (2) │    21.81 │   0.00 │       0.00 │    86.00 │   0.38 │  1.7 │    2125.03 │  1861.32 │
host/mapreduce           (2) │    31.91 │   0.00 │       0.66 │   120.00 │   0.50 │  1.6 │    2999.61 │  1959.46 │
device/tile              (2) │    27.63 │   0.00 │       1.45 │   120.00 │   0.26 │  0.9 │    1642.97 │  2072.34 │
device/core              (2) │    26.08 │   0.00 │       0.91 │   132.00 │   0.10 │  0.4 │     876.67 │  2262.48 │
examples/fmha            (2) │    12.95 │   0.00 │      17.50 │   132.00 │   0.16 │  1.3 │     945.66 │  2306.39 │
device/random            (1) │   320.51 │   0.01 │       0.94 │   116.00 │   9.74 │  3.0 │   78015.57 │  1436.53 │
examples/moe             (2) │    21.70 │   0.01 │    2126.11 │  1892.00 │   2.83 │ 13.1 │    7984.09 │  4356.64 │
examples/layernorm       (1) │    29.26 │   0.00 │      32.42 │   172.00 │   0.29 │  1.0 │    1676.29 │  1644.81 │
extensions/Microfloats   (1) │     3.59 │   0.00 │       0.00 │   108.00 │   0.00 │  0.0 │     287.52 │  1644.81 │
examples/batchmatmul     (1) │    10.29 │   0.00 │      17.00 │   140.00 │   0.05 │  0.5 │     532.86 │  1695.83 │
device/broadcast         (3) │    37.92 │   0.01 │       0.95 │   118.00 │   0.95 │  2.5 │    1918.63 │  1412.29 │
device/atomics           (1) │    11.97 │   0.00 │       0.00 │   140.00 │   0.67 │  5.6 │    1752.97 │  1752.34 │
examples/vadd            (3) │    11.38 │   0.00 │     125.30 │   182.00 │   0.45 │  4.0 │     938.49 │  1554.71 │
examples/softmax         (1) │     9.41 │   0.00 │     204.75 │   334.00 │   0.52 │  5.6 │    1438.75 │  2012.80 │
host/broadcast           (1) │     6.87 │   0.00 │       0.70 │   142.00 │   0.00 │  0.0 │     397.89 │  2012.80 │
device/control_flow      (3) │    12.83 │   0.00 │       0.00 │   118.00 │   0.04 │  0.3 │     495.82 │  1554.71 │
codegen/reflection       (1) │     8.13 │   0.00 │       0.00 │   110.00 │   0.05 │  0.7 │     705.78 │  2012.80 │
device/hints             (3) │     6.56 │   0.00 │       0.14 │   118.00 │   0.00 │  0.0 │     278.45 │  1554.71 │
device/math              (1) │     3.97 │   0.00 │       0.08 │   142.00 │   0.00 │  0.0 │     184.94 │  2012.80 │
examples/matmul          (3) │     4.56 │   0.00 │      16.62 │   118.00 │   0.00 │  0.0 │     239.98 │  1579.24 │
codegen/assume           (1) │     3.22 │   0.00 │       0.00 │   110.00 │   0.00 │  0.0 │     353.22 │  2012.80 │
device/slice             (1) │     2.20 │   0.00 │       0.00 │   142.00 │   0.00 │  0.0 │     117.85 │  2012.80 │
extensions/DLFP8Types    (3) │     2.98 │   0.00 │       0.00 │    86.00 │   0.00 │  0.0 │     232.52 │  1580.36 │
device/gather_scatter    (1) │     2.37 │   0.00 │       0.03 │   142.00 │   0.00 │  0.0 │     117.18 │  2012.80 │
host/cache               (3) │     1.96 │   0.00 │       0.00 │    86.00 │   0.00 │  0.0 │     133.41 │  1586.13 │
codegen/fpmode           (1) │     2.46 │   0.00 │       0.00 │   110.00 │   0.00 │  0.0 │     218.08 │  2012.80 │
device/views             (3) │     2.88 │   0.00 │       0.00 │   118.00 │   0.00 │  0.0 │     176.41 │  1586.13 │
examples/transpose       (3) │     2.00 │   0.00 │      11.50 │   118.00 │   0.00 │  0.0 │     155.82 │  1586.13 │
device/types             (1) │     3.59 │   0.00 │       0.04 │   142.00 │   0.00 │  0.0 │     181.36 │  2012.80 │
device/print             (3) │     1.92 │   0.00 │       0.00 │   118.00 │   0.00 │  0.0 │     126.68 │  1586.13 │
codegen/slice            (1) │     1.67 │   0.00 │       0.00 │   110.00 │   0.00 │  0.0 │     224.77 │  2012.80 │
codegen/rng_intrinsics   (3) │     2.40 │   0.00 │       0.00 │    86.00 │   0.00 │  0.0 │     189.22 │  1586.50 │
codegen/cse              (1) │     1.40 │   0.00 │       0.00 │   110.00 │   0.00 │  0.0 │     192.07 │  2012.80 │
codegen/views            (3) │     1.07 │   0.00 │       0.00 │    86.00 │   0.00 │  0.0 │     139.52 │  1586.50 │
analysis/dataflow        (1) │     1.17 │   0.00 │       0.00 │   110.00 │   0.00 │  0.0 │      37.84 │  2012.80 │
types                    (3) │     0.93 │   0.00 │       0.00 │    86.00 │   0.00 │  0.0 │      38.63 │  1586.50 │
codegen/bounds           (1) │     0.45 │   0.00 │       0.00 │   110.00 │   0.00 │  0.0 │      50.79 │  2012.80 │
device/integration       (3) │     0.71 │   0.00 │       0.02 │   118.00 │   0.00 │  0.0 │      30.87 │  1586.50 │
codegen/no_wrap          (1) │     0.41 │   0.00 │       0.00 │   110.00 │   0.00 │  0.0 │      52.29 │  2012.80 │
codegen/kernel_state     (3) │     0.30 │   0.00 │       0.00 │    86.00 │   0.00 │  0.0 │      29.15 │  1586.50 │
device/kernel_state      (1) │     0.27 │   0.00 │       0.00 │   142.00 │   0.00 │  0.0 │      13.62 │  2012.80 │

This PR splits them up into multiple files so it can leverage ParallelTestRunner.

I also looked into reusing the broadcasted PHILOX_M / PHILOX_W constant tiles across rounds, which made things 5-10% faster:

Click to expand
function philox2x_round(c1::Tile{UInt32, S}, c2::Tile{UInt32, S},
                        k::Tile{UInt32, S}, m::Tile{UInt32, S}) where {S}
    hi = Intrinsics.mulhii(c1, m)
    lo = Intrinsics.muli(c1, m)
    (hi .⊻ k .⊻ c2, lo)
end

philox2x_bumpkey(k::Tile{UInt32, S}, w::Tile{UInt32, S}) where {S} =
    k .+ w

function philox2x_rounds(::Val{R}, c1::Tile{UInt32, S}, c2::Tile{UInt32, S},
                         k::Tile{UInt32, S}) where {R, S}
    m = fill(PHILOX_M, size(c1))
    w = fill(PHILOX_W, size(c1))
    if R > 0;                               c1, c2 = philox2x_round(c1, c2, k, m); end
    if R > 1; k = philox2x_bumpkey(k, w);   c1, c2 = philox2x_round(c1, c2, k, m); end
    if R > 2; k = philox2x_bumpkey(k, w);   c1, c2 = philox2x_round(c1, c2, k, m); end
    if R > 3; k = philox2x_bumpkey(k, w);   c1, c2 = philox2x_round(c1, c2, k, m); end
    if R > 4; k = philox2x_bumpkey(k, w);   c1, c2 = philox2x_round(c1, c2, k, m); end
    if R > 5; k = philox2x_bumpkey(k, w);   c1, c2 = philox2x_round(c1, c2, k, m); end
    if R > 6; k = philox2x_bumpkey(k, w);   c1, c2 = philox2x_round(c1, c2, k, m); end
    return (c1, c2)
end

So it seems it's dominated by codegen, but I don't plan on tackling that here.

@AntonOresten

AntonOresten commented May 27, 2026

Copy link
Copy Markdown
Collaborator Author

The failing test is unrelated, but confusing...

EDIT: perhaps related to https://developer.nvidia.com/blog/nvidia-cuda-13-3-enhances-gpu-development-with-tile-programming-in-c-compiler-autotuning-and-python-updates ? Did something get bumped?

Resolved in #234

@AntonOresten

Copy link
Copy Markdown
Collaborator Author

random_rand alone still takes nearly 3 minutes:

device/random_randexp    (1) │    82.50 │   0.01 │       0.31 │   116.00 │   2.79 │  3.4 │   15860.37 │  1425.95 │
device/random_rand       (2) │   168.99 │   0.01 │       0.33 │   116.00 │   5.24 │  3.1 │   36928.61 │  1396.73 │
device/random_randn      (1) │   100.27 │   0.00 │       0.31 │   116.00 │   2.88 │  2.9 │   26444.82 │  1460.46 │

@maleadt

maleadt commented May 28, 2026

Copy link
Copy Markdown
Member

Maybe use subdirs? device/random/randn reads more nicely IMO.

@maleadt

maleadt commented May 28, 2026

Copy link
Copy Markdown
Member

Actually, I'll try optimizing the compiler first. Maybe this isn't necessary.

maleadt added a commit that referenced this pull request May 28, 2026
As observed in #233, compilation of large kernels is slow. The culprit turned out to be the user/use lookup, which walks the entire IR every time since our IR doesn't encode use-def chains (as opposed to LLVM or MLIR). To avoid this cost, introduce a Rewriter abstraction that caches these lookups while invalidating them upon insertion, mutation, etc. Once again modeled after MLIR.
@maleadt maleadt closed this May 28, 2026
@AntonOresten AntonOresten deleted the random-split branch May 28, 2026 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants